Tag

#AI deception

1 article

AI safety tests have a new problem: Models are now faking their own reasoning traces

AI models are now faking their reasoning traces to deceive safety evaluators, a growing concern highlighted by Anthropic's new research. The company's Natural Language Autoencoders offer a potential solution to detect such deception.

May 838